INTRODUCTION

I am currently an undergraduate student about to move back to Seattle, WA after graduation. One of the major areas of environmental research deal with fish species and populations. I wanted to research more about the fish populations to get a better understanding of the marine and freshwater species near the area. I am currently looking into jobs that require knowledge of fish species in the area and thought this would be a great start. I found a research project on the recolonization of the Pacific Salmon in Cedar River on the Northwest Fisheries Science Center-NOAA Fisheries website. Peter Kiffney is the project investigator studying population, community and ecosystem changes of adult salmon and steelhead fish but the project has also included multiple different types of fish in one of their datasets. I found it an ideal data set to look at fish types in different habitats and see if there are trends in habitat and species type.

WORKING DIRECTORY

Let’s set our working directory, import our data sets,and name them something easy.

-#setwd(“~/MARE353Project1/Midterm_Books”) -install.packages(readr)

library(readr)
#naming the fish abundance data with fish
fish <- read_csv("~/MARE353Project1/Midterm_Books/Fish abundance - Recolonization of the Cedar River by Pacific Salmon 2000-2015 Data.csv")
## Rows: 1813 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Stream, Season, Record Number
## dbl (33): Year, Reach, Trt.Size1, Trt.Size2, Trt.Size3, Trt.Size4, Trt.Size5...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#naming the fish habitat data with hab
hab <- read_csv("~/MARE353Project1/Midterm_Books/Habitat- Recolonization of the Cedar River by Pacific Salmon 2000-2015 Data.csv")
## Rows: 1813 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Stream, Season, Main.Unit.#, Subunit.Num, Habitat.Type, Record Number
## dbl (8): Year, Reach, Unit.Area.M2, Max.Depth.M, Tail.Depth.M, Avg.Depth.M, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

LOOKING AT THE DATA

To get an idea of what variables are contained in the sets let’s take a look

library(psych)
str(fish)
str(hab)
#results are hidden due to length of output

-View(fish)

-View(hab)

Fish The fish abundance has 36 variables with 1,813 observations.

Mostly comprised of numerical variables such as fish type counts.

It does not appear to be missing data visually scrolling and included 0’s in the count, which is great for us.

Habitat The fish habitat has 14 variables with 1,813 observations.

The habitat data is missing a lot of values indicated by NA’s autofilled into the dataset. The main columns missing large amounts of values are the Tail Depth, Average Depth, Residual Depth and Average Velocity. Due to the fact that so many are missing from these columns this habitat data will not be used.

METADATA

The comprehensive list of metadata can be found here: [https://www.fisheries.noaa.gov/inport/item/35546]

Environmental: -The data included sections of the river from above Landsberg Dam to Cedar Falls called Reaches. They defined the reaches based on channel gradient, valley confinement, and dominate substrate numbering them 1-10.

-habitat types are pools(POOL), riffles(RIFFLE), runs(RUN), step pools(SP), cascades(CASCADE), deposition(DEP), and side channels (SC).

-Measurements of Habitats are taken in meters.

Fish:

They often broke down the fish type and size(millimeters):

Size: 1= < 90

2= > 91 < 190

3= > 191 < 250

4= > 250 < 330

5= > 331

Type:

Trt= Unidentified Trout

Rb= Rainbow Trout

Steelhead= Stealheed

Cutt= Cutthroat Trout

Juvcoho= Juvenile Coho

Juvchinook= Juvenile Chinook

Sockeye= Sockeye Salmon

Mwf= Moutnain whitefish trout

Sculipin= Sculipin

Lamprey= Lamprey

Charr = Bull trout

Examples: Trt.Size1 = Unidentified Trout > 90 mm in length

PUTTING DATA TOGETHER AND TIDYING

In order to analyze the Fish and Habitat together we need to combine the datasets and making them easy to process using the dplyr package. The dplyr package requires you to have very clean aka “tidy” data.

Install and turn on Packages:

-‘install.packages(tidyverse)’

-‘install.packages(dplyr)’

-‘install.packages(datarium)’

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.7
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()

Let’s look for some primary keys:

fish %>% count(`Record Number`) %>% filter(n >1)
## # A tibble: 0 × 2
## # … with 2 variables: Record Number <chr>, n <int>
hab %>% count(`Record Number`) %>% filter(n >1)
## # A tibble: 0 × 2
## # … with 2 variables: Record Number <chr>, n <int>

They already had a record number so let’s just see if there was anything else that could be a primary key

fish %>% count(Season, Reach, Year) %>% filter(n>1)
## # A tibble: 128 × 4
##    Season Reach  Year     n
##    <chr>  <dbl> <dbl> <int>
##  1 Summer     0  2007     8
##  2 Summer     0  2008     3
##  3 Summer     0  2009     6
##  4 Summer     0  2010     8
##  5 Summer     0  2011     6
##  6 Summer     0  2012    10
##  7 Summer     1  2000    16
##  8 Summer     1  2001     5
##  9 Summer     1  2004     8
## 10 Summer     1  2005    43
## # … with 118 more rows

These were the only other data matching between the two and did not work so we will stick with Record Number as both data sets have the Record Numbers

LEFT JOINING THE DATA

#Selecting the variables we are going to study in the habitat data and joining with the fish abundance data. 
hab1 <- hab %>% select ('Record Number', Max.Depth.M, Unit.Area.M2, Habitat.Type)
fh1 <- fish %>% left_join(hab1)
## Joining, by = "Record Number"

-View(fh1)

As there are still a good chunk of values missing for certain habitats lets remove the values rather than perform imputation. It is a safer bet in this case.

This gives us TRUE and FALSES. TRUE indicates that there is no NA value in that row and FALSE indicates an NA in that row.

Now we will find which rows have values (complete.cases) and which rows don’t have values(!complete.cases). - ‘which(complete.cases(fh1))’ - ‘which(!complete.cases(fh1))’

Now we will remove the rows within complete.cases on a vector. na_vec <-which(!complete.cases(fh1)) na_vec

Then we create a new dataset by eliminating the vector data from the FH1 fh2 <-fh1[-na_vec,]

complete.cases(fh1)

which(complete.cases(fh1))
which(!complete.cases(fh1))

na_vec <-which(!complete.cases(fh1))
na_vec

fh2 <-fh1[-na_vec,]
##output too large so it is hidden

IMPUTATION JUST FOR FUN

-install.packages(mice)

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
md.pattern(fh1)

mice_plot <- aggr(fh1, col=c('darkgray','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(fh1), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

##It looks like there are only NA's in the Max Depth Column. 

##Theoutput was large and cramped and thus results were hidden

##Some of the packages hate '.' and spaces, lets replace our headers with new ones that disclude those. Don't be me and not check your headers before uploading because this would have been much easier to edit in my CSV.

fh3 <- fh2 %>% rename (TrtSize1 ='Trt.Size1',   TrtSize2 ='Trt.Size2',  TrtSize3 ='Trt.Size3',  TrtSize4 ='Trt.Size4',  TrtSize5 ='Trt.Size5',  RbSize1 = 'Rb.Size1',   RbSize2 ='Rb.Size2',    RbSize3 = 'Rb.Size3',   RbSize4 = 'Rb.Size4',   RbSize5 ='Rb.Size5', SteelheadAdult= 'Steelhead.Adult', CuttSize1 = 'Cutt.Size1',   CuttSize2 = 'Cutt.Size2',   CuttSize3 = 'Cutt.Size3',   CuttSize4 = 'Cutt.Size4', CuttSize5 =   'Cutt.Size5',   SockeyeAdult = 'Sockeye.Adult', MwfSize1 = 'Mwf.Size1', MwfSize2 = 'Mwf.Size2', MwfSize3 = 'Mwf.Size3', MwfSize4 =  'Mwf.Siz4', MwfSize5 = 'Mwf.Szie5', SculpinSize1 = 'Sculpin.Size1', SculpinSize2 = 'Sculpin.Size2', LampreySize1 = 'Lamprey.Size1', LampreySize2 =  'Lamprey.Size2', CharrSize3 =   'Charr.Size3', CharrSize4 =     'Charr.Size4', CharrSize5 =     'Charr.Size5',  Record_Number = 'Record Number' , Habitat_Type = 'Habitat.Type' , UnitArea =    'Unit.Area.M2', MaxDepth = 'Max.Depth.M')
##View(fh3)
fh4 <- fh3 %>% select(MaxDepth,Record_Number)

## Let's imputate

fh4i <- mice(fh4, m = 5, maxit = 50, method = 'pmm', seed = 500)
## Warning: Number of logged events: 1
fh4i$imp$MaxDepth


fh4i<- complete(fh4i,2)

##View(fh4i)

Data was filled in for all of the Max Depth, data was hidden due to long outputs.

Now let’s re-examine our data

-‘View(fh3)’

-‘str(fh3)’

We now have 39 variables and 1,588 observations. All of the column’s have tidy names and there are no missing values.

#Plotting

#lets combine sizes of fish together and see if there are any patterns
fs1 <- c(fh3$TrtSize1+fh3$RbSize1+fh3$CuttSize1+fh3$MwfSize1+fh3$SculpinSize1+fh3$LampreySize1+fh3$TrtSize2+fh3$RbSize2+fh3$CuttSize2+fh3$MwfSize2+fh3$SculpinSize2+fh3$LampreySize2+fh3$TrtSize3+fh3$RbSize3+fh3$CuttSize3+fh3$MwfSize3+fh3$CharrSize3+fh3$TrtSize4+fh3$RbSize4+fh3$CuttSize4+fh3$MwfSize4+fh3$CharrSize4+fh3$TrtSize5+fh3$RbSize5+fh3$CuttSize5+fh3$MwfSize5+fh3$CharrSize5)
hist(fs1)

shapiro.test(fs1)
## 
##  Shapiro-Wilk normality test
## 
## data:  fs1
## W = 0.6902, p-value < 2.2e-16

The data do not look normal let’s try to transform them.

-install.packages(vegan) and turn on

## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
frel <- decostand(fs1, "total")
shapiro.test(frel)
## 
##  Shapiro-Wilk normality test
## 
## data:  frel
## W = 0.46963, p-value < 2.2e-16
fnorm <- decostand(fs1, "normalize")
shapiro.test(fnorm)
## 
##  Shapiro-Wilk normality test
## 
## data:  fnorm
## W = 0.46963, p-value < 2.2e-16
fhel <- decostand(fs1, "hellinger")
shapiro.test(fhel)
## 
##  Shapiro-Wilk normality test
## 
## data:  fhel
## W = 0.46963, p-value < 2.2e-16
fchi <- decostand(fs1, "chi.square")
shapiro.test(fchi)
## 
##  Shapiro-Wilk normality test
## 
## data:  fchi
## W = 0.46963, p-value < 2.2e-16

It appears our data can’t be transformed to normalize it. Which is surprising due to hellinger being a good fit for ecoinformatics. There may be other factors at play.

Let’s look at a model for fun.

MODELS

fs1m <- lm(fs1 ~ MaxDepth + UnitArea, data = fh3)
#summary(fs1m)

ggplot(data = fs1m, aes(x = UnitArea, y = MaxDepth)) + geom_point() + geom_smooth(color = "red") + labs(title = "Unit Area vs Max Depth", y = "Max Depth", x = "Unit Area") + theme_bw()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Citation: Kiffney, Peter; 07/13/2016. NOAA Fisheries Northwest Fisheries Science Center. Fish abundance, composition, distribution: Recolonization of the Cedar River by Pacific Salmon 2000-2015 (https://www.webapps.nwfsc.noaa.gov/apex/parrdata/inventory/tables/table/recolonization_of_the_cedar_river_by_pacific_salmon_20002015)